DSCI 320 Milestone II: Code & Dashboard¶

By Selena Shew, Cassandra Zhang, Jamie Jiang

Group CSJ

December 2, 2023

Data Cleaning¶

In [1]:
#first I will read in the data
import pandas as pd

airbnb = pd.read_csv("airbnb.csv", parse_dates=['host_since'])
In [2]:
airbnb.head()
Out[2]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 13188.0 https://www.airbnb.com/rooms/13188 2.023090e+13 2023-09-06 city scrape Rental unit in Vancouver · ★4.83 · Studio · 2 ... Garden level studio suite with garden patio - ... The uber hip Main street area is a short walk ... https://a0.muscache.com/pictures/8408188/e1af6... 51466 ... 4.92 4.88 4.80 23-156488 f 2 2 0 0 1.68
1 13358.0 https://www.airbnb.com/rooms/13358 2.023090e+13 2023-09-06 city scrape Condo in Vancouver · ★4.68 · 1 bedroom · 1 bed... <b>The space</b><br />This suites central loca... NaN https://a0.muscache.com/pictures/40034c18-0837... 52116 ... 4.79 4.92 4.65 22-311727 f 1 1 0 0 2.96
2 13490.0 https://www.airbnb.com/rooms/13490 2.023090e+13 2023-09-06 city scrape Rental unit in Vancouver · ★4.92 · 1 bedroom ·... This apartment rents for one month blocks of t... In the heart of Vancouver, this apartment has ... https://a0.muscache.com/pictures/73394727/79d5... 52467 ... 4.97 4.79 4.89 NaN f 1 1 0 0 0.66
3 14267.0 https://www.airbnb.com/rooms/14267 2.023090e+13 2023-09-06 city scrape Home in Vancouver · ★4.76 · 1 bedroom · 2 beds... The Ecoloft is located in the lovely, family r... We live in the centre of the city of Vancouver... https://a0.muscache.com/pictures/3646de9b-934e... 56030 ... 4.68 4.77 4.71 21-156500 t 1 1 0 0 0.22
4 14424.0 https://www.airbnb.com/rooms/14424 2.023090e+13 2023-09-06 city scrape Guest suite in Vancouver · ★4.69 · 1 bedroom ·... <b>The space</b><br />Welcome to Strathcona --... NaN https://a0.muscache.com/pictures/miso/Hosting-... 56709 ... 4.72 4.60 4.73 19-162091 f 4 4 0 0 1.63

5 rows × 74 columns

In [3]:
#Then I will clean up the data

#I need to drop all of the columns we won't be using
#I need to filter and rename the property types for simplicity
#I will also rename the room types and superhost designation for ease of understanding
#I will need to remove our weird outliers (we have two data points with a daily rate bigger than $3,000 while everthing else is cheaper)
#I also need to drop any rows with missing (NA) values

#keep only the columns we want to examine
airbnb_cleaned = airbnb[['accommodates','price', 'bathrooms', 'beds', 'number_of_reviews', 'neighbourhood_cleansed',
                        'property_type', 'host_is_superhost', 'review_scores_rating', 'room_type', 
                         'host_response_time', 'host_since', 'latitude', 'longitude']]

#filter to keep only the main property types and rename
keep_prop_types = ['Entire condo', 'Entire rental unit', 'Entire guest suite', 'Entire home', 'Entire townhouse',
             'Private room in condo', 'Private room in home', 'Private room in rental unit',
             'Private room in guest suite', 'Private room in townhouse']

airbnb_cleaned = airbnb_cleaned.query(
    'property_type == @keep_prop_types'
).replace(
     {'Entire condo': 'Condo',
      'Private room in condo': 'Condo',
      'Entire rental unit': 'Rental Suite',
      'Private room in rental unit': 'Rental Suite',
      'Entire guest suite': 'Guest Suite',
      'Private room in guest suite': 'Guest Suite',
      'Entire home': 'House',
      'Private room in home': 'House',
      'Entire townhouse': 'Townhouse',
      'Private room in townhouse': 'Townhouse'
    }
)

#clean the room types & superhost designation
airbnb_cleaned = airbnb_cleaned.replace(
    {"Entire home/apt": 'Entire place',
    't': 'True',
    'f': 'False'}
)

#remove the dollar sign from the price column
airbnb_cleaned['price'] = airbnb_cleaned['price'].str.replace(',', '').str.replace('$', '').astype(float)

#filter out the outlier: where price is greater than $8,000
airbnb_cleaned = airbnb_cleaned.query('price <= 3000.0')

#finally, I'll drop all rows with missing values
airbnb_cleaned = airbnb_cleaned.dropna()

airbnb_cleaned.head()
C:\Users\User\AppData\Local\Temp\ipykernel_28356\1309388831.py:43: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  airbnb_cleaned['price'] = airbnb_cleaned['price'].str.replace(',', '').str.replace('$', '').astype(float)
Out[3]:
accommodates price bathrooms beds number_of_reviews neighbourhood_cleansed property_type host_is_superhost review_scores_rating room_type host_response_time host_since latitude longitude
0 4 151.0 1 2.0 277 Riley Park Rental Suite True 4.83 Entire place within an hour 2009-11-04 49.24773 -123.10509
1 2 215.0 1 1.0 476 West End Condo False 4.68 Entire place within an hour 2009-11-07 49.28201 -123.12669
2 2 150.0 1 1.0 99 Kensington-Cedar Cottage Rental Suite True 4.92 Entire place within an hour 2009-11-08 49.25622 -123.06607
4 2 135.0 1 1.0 269 Downtown Eastside Guest Suite False 4.69 Entire place within a few hours 2009-11-23 49.27921 -123.08835
6 6 100.0 1 4.0 3 Grandview-Woodland House False 4.00 Entire place a few days or more 2009-11-29 49.26339 -123.07145
In [4]:
airbnb_cleaned.shape
Out[4]:
(4286, 14)
In [5]:
airbnb_cleaned.describe()
Out[5]:
accommodates price beds number_of_reviews review_scores_rating latitude longitude
count 4286.000000 4286.000000 4286.000000 4286.000000 4286.000000 4286.000000 4286.000000
mean 3.588661 225.643724 1.949837 51.077462 4.769606 49.262193 -123.110515
std 2.030560 166.831182 1.151349 78.005589 0.389521 0.020632 0.038710
min 1.000000 30.000000 1.000000 1.000000 0.000000 49.202960 -123.214820
25% 2.000000 124.000000 1.000000 5.000000 4.720000 49.249278 -123.130748
50% 3.000000 181.000000 2.000000 20.000000 4.880000 49.267743 -123.111145
75% 4.000000 271.000000 2.000000 64.750000 5.000000 49.278884 -123.086525
max 16.000000 2257.000000 13.000000 916.000000 5.000000 49.294360 -123.023903
In [6]:
#Finally I will export the cleaned dataframe to csv so that my group members can use it
airbnb_cleaned.to_csv('cleaned_airbnb_data_final.csv', index=False)

Task 1: Exploring how the daily rate of the AirBnB varies with the number of beds, bathrooms, and people that can be accommodated.¶

In [7]:
import altair as alt
import geojson
# Handle large data sets without embedding them in the notebook
#alt.data_transformers.enable('data_server')
import vegafusion as vf
vf.enable_widget()
# Default Rendering
alt.renderers.enable('default')
Out[7]:
RendererRegistry.enable('default')
In [8]:
slider2 = alt.binding_range(min=0.1, max = 1.0, step=0.1, name='Opacity:')

op_opacity = alt.param(value = 0.7, bind=slider2)

brush_select = alt.selection_interval(encodings = ['x'], empty = False)

task_1 = alt.Chart(airbnb_cleaned, width = 400, height = 250, title = "Task 1: Daily Price Vs. Number of Beds, Bathrooms, People Accommodated").mark_circle(opacity=op_opacity, stroke='black', strokeWidth=0.5).encode(
    x = alt.X("beds:Q", axis=alt.Axis(grid=False, ticks=False)).title("Number of Beds"),
    y = alt.Y("price:Q", axis=alt.Axis(grid=False, ticks=False)).title("Price Per Day"),
    size = alt.Size("bathrooms:Q").title("Number of Bathrooms"),
    color = alt.condition(brush_select, 'accommodates:Q', alt.value('lightgray')),
    tooltip = alt.Tooltip(['price:Q', 'beds:Q', 'bathrooms:Q'])
).add_params(
    op_opacity,
    brush_select
)

task_1
Out[8]:

The plot above shows how the price changes with the number of beds and bathrooms available. There is a selection interval as well as an opacity slider to more easily see the individual points that are overlapping.

In [9]:
accomm_chart = alt.Chart(airbnb_cleaned, width = 400, height = 250, title = "Daily Price Vs. Number of People Accommodated").mark_circle(size=150, stroke='black', strokeWidth=1).encode(
    x = alt.X("accommodates:Q", axis=alt.Axis(grid=False, ticks=False), scale=alt.Scale(zero=False)).title("Number of People Accommodated"),
    y = alt.Y("price:Q", axis=alt.Axis(grid=False, ticks=False), scale=alt.Scale(zero=False)).title("Price Per Day"),
    tooltip = alt.Tooltip(['price:Q', 'accommodates:Q'])
)

accomm_chart
Out[9]:

The plot above shows how the price varies with the number of people that can be acommodated.

In [10]:
task_1_vis = accomm_chart.encode(color = alt.condition(brush_select, alt.value('#4ba670'), alt.value('lightgray'))).add_params(brush_select) | task_1

task_1_vis
Out[10]:

Now we have put both plots together to address the task. There is bidirectional linking between them via a selection interval, which shows the corresponding number of people that can be acommodated alongside the corresponding number of beds and bathrooms available for each listing.

Task 2: Exploring how the property type and room type of the AirBnB affects the review ratings.¶

In [11]:
brush = alt.selection_interval(encodings = ['y'], empty = True)

selection = alt.selection_point(fields=['room_type'])
color = alt.condition(
    selection,
    alt.Color('room_type:N').legend(None),
    alt.value('lightgray')
)

tick_plot = alt.Chart(airbnb_cleaned, title = "Task 2: Exploring How Review Ratings Vary With Different Property and Room Types").mark_tick(size = 20).encode(
    x = alt.X('property_type:N', title="Property Type",axis=alt.Axis(labelAngle=-45)),
    y = alt.Y('review_scores_rating', title= "Review Scores"),
    color = color
).properties(
    width = 400,
    height = 250
).add_params(brush)

legend = alt.Chart(airbnb_cleaned).mark_point().encode(
    alt.Y('room_type:N', axis=alt.Axis(orient='right')),
    color=color
).add_params(
    selection
)

bars = alt.Chart(airbnb_cleaned, title = "Count of Combination of Property & Room Types").mark_bar().encode(
    x= alt.X('count():Q', title = "Counts of the combination"),
    y= alt.Y('property_type:N', title = "Property Type"),
    color='room_type:N',
    tooltip=['property_type:N', 'count():Q', 'room_type:N']
).properties(
    width=400,
    height=250
).transform_filter(
    brush
)

task_2_vis = tick_plot|legend|bars

task_2_vis
Out[11]:

Task 2 is addressed on the left, and can be filtered by the room type. The plot on the right shows the counts for each combination of property and room type.

Task 3: Exploring how the AirBnB host’s response time as well as whether they are a designated superhost or not affects their review ratings & number of reviews.¶

In [12]:
selection = alt.selection_point(fields=['host_is_super_host', 'host_response_time'])
color = alt.condition(
    selection,
    alt.Color('host_is_superhost:N').legend(None),
    alt.value('lightgray')
)

scatter = alt.Chart(airbnb_cleaned, title = "Task 3: How Review Ratings Vary With Number of Reviews, Host Response Time, & Superhost Status").mark_point(size=88).encode(
    x='number_of_reviews:Q',
    y='review_scores_rating:Q',
    color=color,
    tooltip = ['number_of_reviews:Q', 'review_scores_rating:Q', 'host_response_time:N', 'host_is_superhost:N']
).properties(
    width = 400,
    height = 250
)

legend = alt.Chart(airbnb_cleaned, title= "Filter Combo of Host Response Time & Superhost Status").mark_rect().encode(
    alt.Y('host_is_superhost').axis(orient='right'),
    x=alt.X('host_response_time',axis=alt.Axis(labelAngle=-45)),
    tooltip = ['host_response_time:N', alt.Tooltip('count():Q')],
    color=color
).add_params(
    selection
).properties(
    width = 400,
    height = 100
)

task_3_vis = scatter | legend
task_3_vis
Out[12]:

Here, the filter on the right shows the number as well as review rating for each AirBnB on the left, depending on the host response time and whether the host is a designated superhost or not.

Task 4: Exploring how the daily rate of the AirBnB varies with the neighbourhood location and room type.¶

First we needed to get the geographical map of Vancouver. We found the relevant geojson file here: https://github.com/blackmad/neighborhoods/blob/master/vancouver.geojson

In [13]:
# the code is adapted from https://stackoverflow.com/questions/74168389/can-mark-geoshape-be-used-for-canadian-provinces-cities
can_prov_file = 'vancouver.geojson'
with open(can_prov_file) as f:
    var_geojson = geojson.load(f)
data_geojson = alt.InlineData(values=var_geojson, format=alt.DataFormat(property='features',type='json'))

# chart object
vancouver = alt.Chart(data_geojson).mark_geoshape(fill='lightgray',
    stroke='white'
).project(
    type='identity', reflectY=True
).properties(height=300, width = 800) 
In [14]:
points = alt.Chart(airbnb_cleaned).mark_circle(size=30,opacity=0.8).encode(
    latitude='latitude:Q',
    longitude='longitude:Q',
    color=alt.Color('review_scores_rating', scale = alt.Scale(scheme='plasma',domain=[5,4]),
                   legend = alt.LegendConfig(orient = 'bottom')).title('Review Rating'),  
    tooltip=[alt.Tooltip('review_scores_rating', title='Review Rating'), alt.Tooltip('neighbourhood_cleansed', title='Neighbourhood')]
)



vancouver_map = vancouver + points

#vancouver_map
In [15]:
genres = ['Entire place', 'Private room']

room_type_dropdown = alt.binding_select(options=genres, name="Room Type")
room_type_select = alt.selection_point(fields=['room_type'], bind=room_type_dropdown)

filter_genres = points.add_params(
    room_type_select
).transform_filter(
    room_type_select
).properties(title="Task 4: How Airbnb Ratings Vary With Neighbourhood & Room Type")

map_review_rating = vancouver + filter_genres
map_review_rating
Out[15]:

The map above shows the review ratings for each AirBnB in Vancouver. There is a filter option to show the review ratings for just each room type.

In [16]:
heat_map = alt.Chart(airbnb_cleaned).mark_rect().encode(
    color = alt.Color('mean(review_scores_rating)',scale = alt.Scale(scheme='plasma',domain = [5,4]), legend = None),
    x = alt.X('neighbourhood_cleansed', axis=alt.Axis(labelAngle=-45)).title('Neighbourhood'),
    y = alt.Y('room_type').title('Room Type'),
    tooltip=alt.Tooltip(['mean(review_scores_rating)'], format='.2f')
).properties(height = 60, width = 800)
heat_map
Out[16]:

The heat map shown above displays the average review rating for each combination of room type and neighbourhood.

In [17]:
task_4_vis = alt.vconcat(map_review_rating, heat_map)
task_4_vis
Out[17]:

The final image above combines both visualizations together to answer Task 2.

Final Dashboard¶

Here we put all of our visualizations together:

In [18]:
# dashboard = alt.vconcat(task_1_vis, task_3_vis, task_4_vis, task_2_vis)

# dashboard

display(task_1_vis)
display(task_2_vis)
display(task_3_vis)
display(map_review_rating)
display(heat_map)

*Please note that we did not use the alt.vconcat() or alt.hconcat() methods as that caused our sliders, buttons, and filters to randomly migrate to the bottom.